AITopics | long document

Collaborating Authors

long document

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

OIDA-QA: A Multimodal Benchmark for Analyzing the Opioid Industry Documents Archive

Shen, Xuan, Wingenroth, Brian, Wang, Zichao, Kuen, Jason, Zhu, Wanrong, Zhang, Ruiyi, Wang, Yiwei, Ma, Lichun, Liu, Anqi, Liu, Hongfu, Sun, Tong, Hawkins, Kevin S., Tasker, Kate, Alexander, G. Caleb, Gu, Jiuxiang

arXiv.org Artificial IntelligenceNov-17-2025

The opioid crisis represents a significant moment in public health that reveals systemic shortcomings across regulatory systems, healthcare practices, corporate governance, and public policy. Analyzing how these interconnected systems simultaneously failed to protect public health requires innovative analytic approaches for exploring the vast amounts of data and documents disclosed in the UCSF-JHU Opioid Industry Documents Archive (OIDA). The complexity, multimodal nature, and specialized characteristics of these healthcare-related legal and corporate documents necessitate more advanced methods and models tailored to specific data types and detailed annotations, ensuring the precision and professionalism in the analysis. In this paper, we tackle this challenge by organizing the original dataset according to document attributes and constructing a benchmark with 400k training documents and 10k for testing. From each document, we extract rich multimodal information-including textual content, visual elements, and layout structures-to capture a comprehensive range of features. Using multiple AI models, we then generate a large-scale dataset comprising 360k training QA pairs and 10k testing QA pairs. Building on this foundation, we develop domain-specific multimodal Large Language Models (LLMs) and explore the impact of multimodal inputs on task performance. To further enhance response accuracy, we incorporate historical QA pairs as contextual grounding for answering current queries. Additionally, we incorporate page references within the answers and introduce an importance-based page classifier, further improving the precision and relevance of the information provided. Preliminary results indicate the improvements with our AI assistant in document information extraction and question-answering tasks. The dataset is available at: https://huggingface.co/datasets/opioidarchive/oida-qa

large language model, machine learning, question answering, (19 more...)

arXiv.org Artificial Intelligence

2511.09914

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > California > Merced County > Merced (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Addiction Disorder (0.89)
Health & Medicine > Pharmaceuticals & Biotechnology (0.88)
Health & Medicine > Consumer Health (0.88)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

BudgetMem: Learning Selective Memory Policies for Cost-Efficient Long-Context Processing in Language Models

Alla, Chandra Vamsi Krishna, Gaddam, Harish Naidu, Kommi, Manohar

arXiv.org Artificial IntelligenceNov-10-2025

Large Language Models (LLMs) face significant computational and memory constraints when processing long contexts, despite growing demand for applications requiring reasoning over extensive documents, multi-session dialogues, and book length texts. While recent advances have extended context windows to 100K-1M tokens, such approaches incur prohibitive costs for resource constrained deployments. We propose BudgetMem, a novel memory augmented architecture that learns what to remember rather than remembering everything. Our system combines selective memory policies with feature based salience scoring (entity density, TF-IDF, discourse markers, position bias) to decide which information merits storage under strict budget constraints. Unlike existing retrieval augmented generation (RAG) systems that store all chunks, BudgetMem employs learned gating mechanisms coupled with BM25 sparse retrieval for efficient information access. Through comprehensive experiments on 700 question answer pairs across short (237 tokens) and long (5K-10K tokens) documents with Llama-3.2-3B-Instruct, we demonstrate that BudgetMem achieves remarkable results on long documents: only 1.0% F1 score degradation while saving 72.4% memory compared to baseline RAG. We validate our approach through budget sensitivity analysis (testing 7 budget ratios), naive baseline comparisons, and document length analysis, showing that BudgetMem's benefits increase with document length. Our work provides a practical pathway for deploying capable long context systems on modest hardware, democratizing access to advanced language understanding capabilities.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.04919

Country: North America > United States (0.14)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

bda5c35eded86adaf0231748e3ce071c-Paper-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 16:31:36 GMT

data mining, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

Asia > Singapore (0.05)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Data Science > Data Mining (0.68)
(2 more...)

Add feedback

Beyond Token Limits: Assessing Language Model Performance on Long Text Classification

Sebők, Miklós, Kovács, Viktor, Bánóczy, Martin, Eriksen, Daniel Møller, Neptune, Nathalie, Roussille, Philippe

arXiv.org Artificial IntelligenceSep-30-2025

The most widely used large language models in the social sciences (such as BERT, and its derivatives, e.g. RoBERTa) have a limitation on the input text length that they can process to produce predictions. This is a particularly pressing issue for some classification tasks, where the aim is to handle long input texts. One such area deals with laws and draft laws (bills), which can have a length of multiple hundred pages and, therefore, are not particularly amenable for processing with models that can only handle e.g. 512 tokens. In this paper, we show results from experiments covering 5 languages with XLM-RoBERTa, Longformer, GPT-3.5, GPT-4 models for the multiclass classification task of the Comparative Agendas Project, which has a codebook of 21 policy topic labels from education to health care. Results show no particular advantage for the Longformer model, pre-trained specifically for the purposes of handling long inputs. The comparison between the GPT variants and the best-performing open model yielded an edge for the latter. An analysis of class-level factors points to the importance of support and substance overlaps between specific categories when it comes to performance on long text inputs.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2509.10199

Country:

Africa (0.14)
Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
(6 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Government > Regional Government (0.93)
Law Enforcement & Public Safety (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

MAPEX: A Multi-Agent Pipeline for Keyphrase Extraction

Zhang, Liting, Zhao, Shiwan, Kong, Aobo, Li, Qicheng

arXiv.org Artificial IntelligenceSep-25-2025

Keyphrase extraction is a fundamental task in natural language processing. However, existing unsupervised prompt-based methods for Large Language Models (LLMs) often rely on single-stage inference pipelines with uniform prompting, regardless of document length or LLM backbone. Such one-size-fits-all designs hinder the full exploitation of LLMs' reasoning and generation capabilities, especially given the complexity of keyphrase extraction across diverse scenarios. To address these challenges, we propose MAPEX, the first framework that introduces multi-agent collaboration into keyphrase extraction. MAPEX coordinates LLM-based agents through modules for expert recruitment, candidate extraction, topic guidance, knowledge augmentation, and post-processing. A dual-path strategy dynamically adapts to document length: knowledge-driven extraction for short texts and topic-guided extraction for long texts. Extensive experiments on six benchmark datasets across three different LLMs demonstrate its strong generalization and universality, outperforming the state-of-the-art unsupervised method by 2.44% and standard LLM baselines by 4.01% in F1@5 on average. Code is available at https://github.com/NKU-LITI/MAPEX.

extraction, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2509.18813

Country:

South America > Colombia > Meta Department > Villavicencio (0.04)
Europe > Sweden > Uppsala County > Uppsala (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Asia > China > Tianjin Province > Tianjin (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Long document summarization using page specific target text alignment and distilling page importance

Devi, Pushpa, Agrawal, Ayush, Dubey, Ashutosh, Chowdary, C. Ravindranath

arXiv.org Artificial IntelligenceSep-23-2025

The rapid growth of textual data across news, legal, medical, and scientific domains is becoming a challenge for efficiently accessing and understanding large volumes of content. It is increasingly complex for users to consume and extract meaningful information efficiently. Thus, raising the need for summarization. Unlike short document summarization, long document abstractive summarization is resource-intensive, and very little literature is present in this direction. BART is a widely used efficient sequence-to-sequence (seq-to-seq) model. However, when it comes to summarizing long documents, the length of the context window limits its capabilities. We proposed a model called PTS (Page-specific Target-text alignment Summarization) that extends the seq-to-seq method for abstractive summarization by dividing the source document into several pages. PTS aligns each page with the relevant part of the target summary for better supervision. Partial summaries are generated for each page of the document. We proposed another model called PTSPI (Page-specific Target-text alignment Summarization with Page Importance), an extension to PTS where an additional layer is placed before merging the partial summaries into the final summary. This layer provides dynamic page weightage and explicit supervision to focus on the most informative pages. We performed experiments on the benchmark dataset and found that PTSPI outperformed the SOTA by 6.32\% in ROUGE-1 and 8.08\% in ROUGE-2 scores.

computational linguistic, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2509.16539

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > Austria > Vienna (0.14)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(10 more...)

Genre: Research Report (1.00)

Industry: Law (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Single-Pass Document Scanning for Question Answering

Cao, Weili, Wang, Jianyou, Zheng, Youze, Bao, Longtian, Zheng, Qirui, Berg-Kirkpatrick, Taylor, Paturi, Ramamohan, Bergen, Leon

arXiv.org Artificial IntelligenceAug-11-2025

Handling extremely large documents for question answering is challenging: chunk-based embedding methods often lose track of important global context, while full-context transformers can be prohibitively expensive for hundreds of thousands of tokens. We propose a single-pass document scanning approach that processes the entire text in linear time, preserving global coherence while deciding which sentences are most relevant to the query. On 41 QA benchmarks, our single-pass scanner consistently outperforms chunk-based embedding methods and competes with large language models at a fraction of the computational cost. By conditioning on the entire preceding context without chunk breaks, the method preserves global coherence, which is especially important for long documents. Overall, single-pass document scanning offers a simple solution for question answering over massive text. All code, datasets, and model checkpoints are available at https://github.com/MambaRetriever/MambaRetriever

large language model, machine learning, question answering, (19 more...)

arXiv.org Artificial Intelligence

2504.03101

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(6 more...)

Genre: Research Report (0.50)

Industry:

Leisure & Entertainment (0.67)
Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

Hard Negative Mining for Domain-Specific Retrieval in Enterprise Systems

Meghwani, Hansa, Agarwal, Amit, Pattnayak, Priyaranjan, Patel, Hitesh Laxmichand, Panda, Srikant

arXiv.org Artificial IntelligenceMay-27-2025

Enterprise search systems often struggle to retrieve accurate, domain-specific information due to semantic mismatches and overlapping terminologies. These issues can degrade the performance of downstream applications such as knowledge management, customer support, and retrieval-augmented generation agents. To address this challenge, we propose a scalable hard-negative mining framework tailored specifically for domain-specific enterprise data. Our approach dynamically selects semantically challenging but contextually irrelevant documents to enhance deployed re-ranking models. Our method integrates diverse embedding models, performs dimensionality reduction, and uniquely selects hard negatives, ensuring computational efficiency and semantic precision. Evaluation on our proprietary enterprise corpus (cloud services domain) demonstrates substantial improvements of 15\% in MRR@3 and 19\% in MRR@10 compared to state-of-the-art baselines and other negative sampling techniques. Further validation on public domain-specific datasets (FiQA, Climate Fever, TechQA) confirms our method's generalizability and readiness for real-world applications.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2505.18366

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
North America > United States > Washington > King County > Seattle (0.04)
Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
(2 more...)

Genre: Research Report (0.82)

Industry: Information Technology > Services (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

A Split-then-Join Approach to Abstractive Summarization for Very Long Documents in a Low Resource Setting

Fazry, Lhuqita

arXiv.org Artificial IntelligenceMay-13-2025

$\texttt{BIGBIRD-PEGASUS}$ model achieves $\textit{state-of-the-art}$ on abstractive text summarization for long documents. However it's capacity still limited to maximum of $4,096$ tokens, thus caused performance degradation on summarization for very long documents. Common method to deal with the issue is to truncate the documents. In this reasearch, we'll use different approach. We'll use the pretrained $\texttt{BIGBIRD-PEGASUS}$ model by fine tuned the model on other domain dataset. First, we filter out all documents which length less than $20,000$ tokens to focus on very long documents. To prevent domain shifting problem and overfitting on transfer learning due to small dataset, we augment the dataset by splitting document-summary training pair into parts, to fit the document into $4,096$ tokens. Source code available on $\href{https://github.com/lhfazry/SPIN-summ}{https://github.com/lhfazry/SPIN-summ}$.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.06862

Country:

Asia > Indonesia (0.05)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(5 more...)

Genre: Research Report > New Finding (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale

Li, Jiaxi, Zhang, Xingxing, Wang, Xun, Huang, Xiaolong, Dong, Li, Wang, Liang, Chen, Si-Qing, Lu, Wei, Wei, Furu

arXiv.org Artificial IntelligenceFeb-23-2025

Large language models (LLMs) with extended context windows enable tasks requiring extensive information integration but are limited by the scarcity of high-quality, diverse datasets for long-context instruction tuning. Existing data synthesis methods focus narrowly on objectives like fact retrieval and summarization, restricting their generalizability to complex, real-world tasks. WildLong extracts meta-information from real user queries, models co-occurrence relationships via graph-based methods, and employs adaptive generation to produce scalable data. It extends beyond single-document tasks to support multi-document reasoning, such as cross-document comparison and aggregation. Our models, finetuned on 150K instruction-response pairs synthesized using WildLong, surpasses existing open-source long-context-optimized models across benchmarks while maintaining strong performance on short-context tasks without incorporating supplementary short-context data. By generating a more diverse and realistic long-context instruction dataset, WildLong enhances LLMs' ability to generalize to complex, real-world reasoning over long contexts, establishing a new paradigm for long-context data synthesis.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2502.16684

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(3 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback